Table Metadata: Headers, Augmentations and Aggregates

نویسندگان

  • George Nagy
  • Mukkai Krishnamoorthy
چکیده

A sample of 200 web tables was interactively converted into layout-independent Augmented Wang Notation (AWN) using the Table Abstraction Tool (TAT). The resulting XML ground-truth files list for each table (1) cell contents, (2) relationships between the hierarchical column and row headers and the value/content/data cells, (3) designators for aggregates like totals and averages, and (4) ancillary information (augmentations) represented by table titles and captions, footnotes, and unit indicators. On average, these tables have 585 cells, 8.8 footnotes, and 1.4 rows of aggregates. They differ widely in number of cells, Wang dimensionality, and MHTML and AWN/XML file sizes. Even though TAT automates much of the repetitive work, interactive ground-truthing took on average four minutes per table. The collected ground truth is offered to the research community for experimentation on automated table processing and for realistic pseudo-random generation of table data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Interactive Conversion of Web Tables

Two hundred web tables from ten sites were imported into Excel. The tables were edited as needed, then converted into layout independent Wang Notation using the Table Abstraction Tool (TAT). The output generated by TAT consists of XML files to be used for constructing narrow-domain ontologies. On an average each table required 104 seconds for editing. Augmentations like aggregates, footnotes, t...

متن کامل

A Flexible Table Parsing Approach

Relational data is often encoded in tables. Tables are easy to read by humans, but difficult to interpret automatically. In cases where table layout cues are not obtainable (missing HTML tags) or where columns are distorted (by copying from a spreadsheet to text) previous table extraction approaches run into problems. This paper introduces a novel table parsing approach. Our approach is based o...

متن کامل

Abstractive Tabular Dataset Summarization via Knowledge Base Semantic Embeddings

Œis paper describes an abstractive summarization method1 for tabular datawhich employs a knowledge base semantic embedding to generate the summary. Assuming the dataset contains descriptive text in headers, columns and/or some augmenting metadata, the system employs the embedding to recommend a subject/type for each text segment. Recommendations are aggregated into a small collection of super t...

متن کامل

Table Header Detection and Classification

In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of .that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table...

متن کامل

Annotating Table Headers Based on Semantic Web Resources

—Tables offer an often used way to represent information for the human reader. But as long as those tables are not annotated with semantic information they are meaningless to machines. In this work a methodology is proposed to annotate the headers of table columns with semantic types by creating a ranking of possible column headers based on the column cells. In the performed experiments on 10 i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010